the customer's 'MonthlyCharge' supposed to be one of the most important features to any company, probably the company's profit depends on increasing the Monthly Charge of all customers, I think it can be considered the second most important variable included in the Churn data set (coming after the 'Churn' variable), and specifically the first most important continuous variable.
Multiple regression will be used to check the factors and features that affect the MonthlyCharge and the significance of each of them.
Obviously, the monthly charge of each customer depends the most on the number and types of services provided, but some of them may be more important than others. and more significant than others. the objective of the data analysis is to identify the features and to test their inter relationships in order to build a multiple regression prediction model that can be used to predict the monthly charge of each customer based on available features or criteria. thus giving the stake holders the insight to avoid the negative factors and to support the positive ones that may affect the monthly charge , and probably the company's profit.
After :
Selected Python, the general-purpose, interpreted, object-oriented language, which supports many useful packages for creating linear models Selected Python libraries such as:
The monthly charge of each customer obviously depends the most on the number and types of services provided, but some of them may be more important than others. and more significant than others. Multiple regression is an appropriate technique to check the factors and features that affect the MonthlyCharge and the significance of each of them, by modelling the relationship between multiple explanatory variables to the single dependent variable ('MonthlyCharge') . and more important , to predict the monthly charge of each customer based on the selected features.
#imporing the needed libraries.
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None) #modified Pandas options to fully display the large dataset when needed
## !!! Trying to Display a full Dataframe may cause a problem.
import numpy as np
np.set_printoptions(precision=5, suppress=True)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import sklearn
from sklearn import linear_model
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
#Reading the Data file (As received) churn_clean.csv
df_raw= pd.read_csv('churn_clean.csv')
target='MonthlyCharge'
df_raw.head()
| CaseOrder | Customer_id | Interaction | UID | City | State | County | Zip | Lat | Lng | Population | Area | TimeZone | Job | Children | Age | Income | Marital | Gender | Churn | Outage_sec_perweek | Contacts | Yearly_equip_failure | Techie | Contract | Port_modem | Tablet | InternetService | Phone | Multiple | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | PaperlessBilling | PaymentMethod | Tenure | MonthlyCharge | Bandwidth_GB_Year | Item1 | Item2 | Item3 | Item4 | Item5 | Item6 | Item7 | Item8 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | K409198 | aa90260b-4141-4a24-8e36-b04ce1f4f77b | e885b299883d4f9fb18e39c75155d990 | Point Baker | AK | Prince of Wales-Hyder | 99927 | 56.25100 | -133.37571 | 38 | Urban | America/Sitka | Environmental health practitioner | 0 | 68 | 28561.99 | Widowed | Male | No | 7.978323 | 10 | 0 | 1 | No | One year | Yes | Yes | Fiber Optic | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Credit Card (automatic) | 6.795513 | 172.455519 | 904.536110 | 5 | 5 | 5 | 3 | 4 | 4 | 3 | 4 |
| 1 | 2 | S120509 | fb76459f-c047-4a9d-8af9-e0f7d4ac2524 | f2de8bef964785f41a2959829830fb8a | West Branch | MI | Ogemaw | 48661 | 44.32893 | -84.24080 | 10446 | Urban | America/Detroit | Programmer, multimedia | 1 | 27 | 21704.77 | Married | Female | Yes | 11.699080 | 12 | 0 | 1 | Yes | Month-to-month | No | Yes | Fiber Optic | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Bank Transfer(automatic) | 1.156681 | 242.632554 | 800.982766 | 3 | 4 | 3 | 3 | 4 | 3 | 4 | 4 |
| 2 | 3 | K191035 | 344d114c-3736-4be5-98f7-c72c281e2d35 | f1784cfa9f6d92ae816197eb175d3c71 | Yamhill | OR | Yamhill | 97148 | 45.35589 | -123.24657 | 3735 | Urban | America/Los_Angeles | Chief Financial Officer | 4 | 50 | 9609.57 | Widowed | Female | No | 10.752800 | 9 | 0 | 1 | Yes | Two Year | Yes | No | DSL | Yes | Yes | No | No | No | No | No | Yes | Yes | Credit Card (automatic) | 15.754144 | 159.947583 | 2054.706961 | 4 | 4 | 2 | 4 | 4 | 3 | 3 | 3 |
| 3 | 4 | D90850 | abfa2b40-2d43-4994-b15a-989b8c79e311 | dc8a365077241bb5cd5ccd305136b05e | Del Mar | CA | San Diego | 92014 | 32.96687 | -117.24798 | 13863 | Suburban | America/Los_Angeles | Solicitor | 1 | 48 | 18925.23 | Married | Male | No | 14.913540 | 15 | 2 | 0 | Yes | Two Year | No | No | DSL | Yes | No | Yes | No | No | No | Yes | No | Yes | Mailed Check | 17.087227 | 119.956840 | 2164.579412 | 4 | 4 | 4 | 2 | 5 | 4 | 3 | 3 |
| 4 | 5 | K662701 | 68a861fd-0d20-4e51-a587-8a90407ee574 | aabb64a116e83fdc4befc1fbab1663f9 | Needville | TX | Fort Bend | 77461 | 29.38012 | -95.80673 | 11352 | Suburban | America/Chicago | Medical illustrator | 0 | 83 | 40074.19 | Separated | Male | Yes | 8.147417 | 16 | 2 | 1 | No | Month-to-month | Yes | No | Fiber Optic | No | No | No | No | No | Yes | Yes | No | No | Mailed Check | 1.670972 | 149.948316 | 271.493436 | 4 | 4 | 4 | 3 | 4 | 4 | 4 | 5 |
df_raw.info() #Data Information of the Raw DataFrame (As received)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CaseOrder 10000 non-null int64 1 Customer_id 10000 non-null object 2 Interaction 10000 non-null object 3 UID 10000 non-null object 4 City 10000 non-null object 5 State 10000 non-null object 6 County 10000 non-null object 7 Zip 10000 non-null int64 8 Lat 10000 non-null float64 9 Lng 10000 non-null float64 10 Population 10000 non-null int64 11 Area 10000 non-null object 12 TimeZone 10000 non-null object 13 Job 10000 non-null object 14 Children 10000 non-null int64 15 Age 10000 non-null int64 16 Income 10000 non-null float64 17 Marital 10000 non-null object 18 Gender 10000 non-null object 19 Churn 10000 non-null object 20 Outage_sec_perweek 10000 non-null float64 21 Email 10000 non-null int64 22 Contacts 10000 non-null int64 23 Yearly_equip_failure 10000 non-null int64 24 Techie 10000 non-null object 25 Contract 10000 non-null object 26 Port_modem 10000 non-null object 27 Tablet 10000 non-null object 28 InternetService 10000 non-null object 29 Phone 10000 non-null object 30 Multiple 10000 non-null object 31 OnlineSecurity 10000 non-null object 32 OnlineBackup 10000 non-null object 33 DeviceProtection 10000 non-null object 34 TechSupport 10000 non-null object 35 StreamingTV 10000 non-null object 36 StreamingMovies 10000 non-null object 37 PaperlessBilling 10000 non-null object 38 PaymentMethod 10000 non-null object 39 Tenure 10000 non-null float64 40 MonthlyCharge 10000 non-null float64 41 Bandwidth_GB_Year 10000 non-null float64 42 Item1 10000 non-null int64 43 Item2 10000 non-null int64 44 Item3 10000 non-null int64 45 Item4 10000 non-null int64 46 Item5 10000 non-null int64 47 Item6 10000 non-null int64 48 Item7 10000 non-null int64 49 Item8 10000 non-null int64 dtypes: float64(7), int64(16), object(27) memory usage: 3.8+ MB
raw_variables=['CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State',
'County', 'Zip', 'Lat', 'Lng', 'Population', 'Area', 'TimeZone', 'Job',
'Children', 'Age', 'Income', 'Marital', 'Gender', 'Churn',
'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure',
'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService',
'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
'PaperlessBilling', 'PaymentMethod', 'Tenure', 'MonthlyCharge',
'Bandwidth_GB_Year', 'Item1', 'Item2', 'Item3', 'Item4', 'Item5',
'Item6', 'Item7', 'Item8']
num_variables = ['CaseOrder', 'Zip', 'Lat', 'Lng', 'Population', 'Children',
'Age','Income', 'Outage_sec_perweek', 'Email', 'Contacts',
'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year',
'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8']
categ_variables = ['Customer_id', 'Interaction', 'UID', 'City', 'State', 'County', 'Area',
'TimeZone', 'Job', 'Marital', 'Gender', 'Churn', 'Techie', 'Contract',
'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod']
print('raw_variables:',len(raw_variables),' num_variables:',len(num_variables),' cat_variables:',len(categ_variables))
raw_variables: 50 num_variables: 23 cat_variables: 27
#Summary statistics of Numerical variables of the Raw DataFrame
df_raw.describe().T #Transposed to show large number of variables displayed as rows
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CaseOrder | 10000.0 | 5000.500000 | 2886.895680 | 1.000000 | 2500.750000 | 5000.500000 | 7500.250000 | 10000.000000 |
| Zip | 10000.0 | 49153.319600 | 27532.196108 | 601.000000 | 26292.500000 | 48869.500000 | 71866.500000 | 99929.000000 |
| Lat | 10000.0 | 38.757567 | 5.437389 | 17.966120 | 35.341828 | 39.395800 | 42.106908 | 70.640660 |
| Lng | 10000.0 | -90.782536 | 15.156142 | -171.688150 | -97.082813 | -87.918800 | -80.088745 | -65.667850 |
| Population | 10000.0 | 9756.562400 | 14432.698671 | 0.000000 | 738.000000 | 2910.500000 | 13168.000000 | 111850.000000 |
| Children | 10000.0 | 2.087700 | 2.147200 | 0.000000 | 0.000000 | 1.000000 | 3.000000 | 10.000000 |
| Age | 10000.0 | 53.078400 | 20.698882 | 18.000000 | 35.000000 | 53.000000 | 71.000000 | 89.000000 |
| Income | 10000.0 | 39806.926771 | 28199.916702 | 348.670000 | 19224.717500 | 33170.605000 | 53246.170000 | 258900.700000 |
| Outage_sec_perweek | 10000.0 | 10.001848 | 2.976019 | 0.099747 | 8.018214 | 10.018560 | 11.969485 | 21.207230 |
| 10000.0 | 12.016000 | 3.025898 | 1.000000 | 10.000000 | 12.000000 | 14.000000 | 23.000000 | |
| Contacts | 10000.0 | 0.994200 | 0.988466 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 7.000000 |
| Yearly_equip_failure | 10000.0 | 0.398000 | 0.635953 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 |
| Tenure | 10000.0 | 34.526188 | 26.443063 | 1.000259 | 7.917694 | 35.430507 | 61.479795 | 71.999280 |
| MonthlyCharge | 10000.0 | 172.624816 | 42.943094 | 79.978860 | 139.979239 | 167.484700 | 200.734725 | 290.160419 |
| Bandwidth_GB_Year | 10000.0 | 3392.341550 | 2185.294852 | 155.506715 | 1236.470827 | 3279.536903 | 5586.141369 | 7158.981530 |
| Item1 | 10000.0 | 3.490800 | 1.037797 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 7.000000 |
| Item2 | 10000.0 | 3.505100 | 1.034641 | 1.000000 | 3.000000 | 4.000000 | 4.000000 | 7.000000 |
| Item3 | 10000.0 | 3.487000 | 1.027977 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 8.000000 |
| Item4 | 10000.0 | 3.497500 | 1.025816 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 7.000000 |
| Item5 | 10000.0 | 3.492900 | 1.024819 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 7.000000 |
| Item6 | 10000.0 | 3.497300 | 1.033586 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 8.000000 |
| Item7 | 10000.0 | 3.509500 | 1.028502 | 1.000000 | 3.000000 | 4.000000 | 4.000000 | 7.000000 |
| Item8 | 10000.0 | 3.495600 | 1.028633 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 8.000000 |
*Function plt_summary() to inspect the data visually usind Histogram, Boxplot and scattered plots. The following function is defined to visually identify statistical parameters and to get the sense from the Data, such as identify the outliers , ranges, dominant values,.etc
'''Custom function to inspect the data visually using Histogram/Bar, Boxplot and scattered plots
this function takes 3 parameters, 1st is DataFrame , 2nd is the preferable plot ("hist" , "box" , " scat" , "target_scat" or "target_histplot" )
The target argument will be used only in the case of target vs variables scat plots, and will not be used for other options
For Histograms with non-numeric columns , the function will use Bar charts (in Green color) instead of Histogram
"scat": plotting scatterd plot against the 1st column of the D.F.
"target_scat": plotting scatterd plot of num. variables against a 'target' column of the D.F.
"target_histplot":plotting histplot of numeric or countplot of cat. variables against the target' column
'''
def plt_summary(df_1,plt_type,target):
numerics = ['uint8','uint16','uint32','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
if plt_type =='target_histplot':
pltlist=df_1.columns
# for number,cols in enumerate(df_1.columns):
# if (len(df_1[cols].value_counts())>=20):
# pltlist.remove(cols)
print("Number of columns: "+str(len(pltlist))) #Printing the number of columns.
elif plt_type =='hist':
pltlist=df_1.columns
print("Number of columns: "+str(df_1.shape[1])) #Printing the number of columns.
else :
df_n = df_1.select_dtypes(include=numerics).copy() # selecting only the numeric columns as it's suitable for mentioned plots
pltlist=df_n.columns
print("Number of Numeric columns: "+str(df_n.shape[1])) #Printing the number of numeric columns.
plt.style.use('default')
plt.figure(figsize=(15,30))
plt_rows = 11
plt_cols = 5
for n, col in enumerate(pltlist): #enumerating each column of the data, starting from index
b=n+1
ax=plt.subplot2grid(shape=(plt_rows,plt_cols),loc=((int(np.ceil(b/plt_cols)-1)),((b-1)%plt_cols)))
if plt_type == 'hist':
if df_1[col].dtype in numerics:
ax.hist(df_1.loc[:,col],bins = 50) #plotting Histogram
ax.set_title(str(col)+' (*num)')
else:
if len(df_1[col].value_counts())>2000:
ax.set_title(str(col)+' ('+str(len(df_1[col].value_counts()))+' unq val.)')
continue
df_1.groupby(col).size().plot.bar(rot=0,color='g') #plotting Bar chart
ax.set_title(str(col)+' (*categ. '+str(len(df_1[col].value_counts()))+' cat.)')
if len(df_1[col].value_counts())>20:
ax.tick_params(bottom=False,labelbottom=False)
elif plt_type == 'box':
msk = ~np.isnan(df_n.loc[:,col])
ax.boxplot(df_n.loc[msk,col]) #plotting Box plot (used a mask to filter the null values)
#ax.boxplot(df.loc[:,col])
ax.set_title(col)
elif plt_type == 'scat':
ax.scatter(df_n.iloc[:,0],df_n.loc[:,col],s=0.1,alpha=0.1) #plotting scatterd plot against the index of the D.F.
ax.set_title(col)
elif plt_type == 'target_scat':
ax.scatter(df_n.loc[:,col],df_n.loc[:,target],s=1,alpha=0.1,color='b') #plotting scatterd plot of num. variables against the 'target' column of the D.F.
#ax.set_title(col)
ax.set_xlabel(col)
elif plt_type == 'target_hexbin':
ax = df_n.plot.hexbin(ax=ax,x=col, y=target,gridsize=50, sharex=False) #plotting hexbin plot
ax.set_title(col)
elif (plt_type == 'target_histplot'):
if (df_1[target].dtype not in numerics):
ax.set_title(str(col)+' (*categ. '+str(len(df_1[col].value_counts()))+' cat.)')
if (len(df_1[col].value_counts())<=700):
sns.countplot(ax=ax, data = df_1,x = col,hue = target,alpha=0.7) #plotting countplot of cat. variables against the 'target' column
if len(df_1[col].value_counts())>20:
ax.tick_params(bottom=False,labelbottom=False)
elif (df_1[col].dtype in numerics):
sns.histplot(df_1, x=col, hue=target, element="poly",ax=ax) #plotting histplot of num. variables against the 'target' column
ax.set_title(str(col)+' (*num)')
# ax.plt.legend(fontsize='xx-small', title_fontsize='xx-small')
# plt.setp(ax.get_legend().get_texts(), fontsize='5') # for legend text
# plt.setp(ax.get_legend().get_title(), fontsize='32') # for legend title
elif df_1[target].dtype in numerics:
if (df_1[col].dtype not in numerics):
ax.set_title(str(col)+' (*categ. '+str(len(df_1[col].value_counts()))+' cat.)')
if (len(df_1[col].value_counts())<=5):
sns.histplot(df_1, x=target, hue=col, element="poly",ax=ax) #plotting histplot of num. variables against the 'target' column
elif (df_1[col].dtype in numerics):
ax = df_1.plot.hexbin(ax=ax,x=target, y=col,gridsize=30, sharex=False, cmap="cubehelix_r") #plotting hexbin plot
# continue
plt.tight_layout()
plt.show()
return
*Function cat2num() : to convert categorical variables into serial numeric values in integer format
'''
Custom function to convert categorical variables into numeric form.
this function takes 1 parameters, input DataFrame and creates a dectionary for each categorical variable
where dictinary keys are the categogries sorted alphabetically and the values are serial numeric values
the maximum number of categories this function can handle is 7000
'''
def cat2num(df_1):
numerics = ['uint8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num_conv=df_1.copy()
for n, col in enumerate(df_num_conv.columns):
if (df_num_conv[col].dtype in numerics) or (len(df_num_conv[col].value_counts())>7000):
continue
else:
cat_dict={y:x for x,y in enumerate(dict(df_raw[col].value_counts().sort_index(ascending=True)))}
df_num_conv[col]=df_num_conv[col].replace(cat_dict)
return df_num_conv
# To be able to correlate with the categorical variables
df_num=cat2num(df_raw)
%%time
#calling the plt_summary to the Raw DataFrame
plt_summary(df_raw,"hist",target)
Number of columns: 50
Wall time: 38.1 s
%%time
#calling the plt_summary to the Raw DataFrame
plt_summary(df_raw,"target_histplot",target)
Number of columns: 50
Wall time: 22.3 s
%%time
#calling the plt_summary to the num DataFrame
plt_summary(df_num,'target_scat',target)
Number of Numeric columns: 47
Wall time: 13.5 s
*Function plot_corr_ellipses() imported and modified From text Book "Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python",(Bruce et al.,2020) - The associated GitHub code repository
'''
From text Book "Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python"
The associated GitHub code repository
'''
from matplotlib.collections import EllipseCollection
from matplotlib.colors import Normalize
def plot_corr_ellipses(data, figsize=None, **kwargs):
''' https://stackoverflow.com/a/34558488 '''
M = np.array(data)
if not M.ndim == 2:
raise ValueError('data must be a 2D array')
fig, ax = plt.subplots(1, 1, figsize=figsize, subplot_kw={'aspect':'equal'})
ax.set_xlim(-0.5, M.shape[1] - 0.5)
ax.set_ylim(-0.5, M.shape[0] - 0.5)
ax.invert_yaxis()
# xy locations of each ellipse center
xy = np.indices(M.shape)[::-1].reshape(2, -1).T
# set the relative sizes of the major/minor axes according to the strength of
# the positive/negative correlation
w = np.ones_like(M).ravel() + 0.01
h = 1 - np.abs(M).ravel() - 0.01
a = 45 * np.sign(M).ravel()
ec = EllipseCollection(widths=w, heights=h, angles=a, units='x', offsets=xy,
norm=Normalize(vmin=-1, vmax=1),
transOffset=ax.transData, array=M.ravel(), **kwargs)
ax.add_collection(ec)
# if data is a DataFrame, use the row/column names as tick labels
if isinstance(data, pd.DataFrame):
ax.set_xticks(np.arange(M.shape[1]))
ax.set_xticklabels(data.columns, rotation=90)
ax.tick_params(axis="x", bottom=True, top=True, labelbottom=True, labeltop=True)
ax.set_yticks(np.arange(M.shape[0]))
ax.set_yticklabels(data.index)
ax.grid (True, color = "grey", linewidth = "0.5", linestyle = "-")
return ec
ax= plot_corr_ellipses(df_num.corr(), figsize=(12, 12), cmap='bwr_r')
plt.colorbar(ax)
plt.tight_layout()
plt.show()
#Principal Component Analysis
df_numpca =df_num.copy()
#Dropping target column
target = 'MonthlyCharge'
df_numpca.drop(columns=target,inplace=True)
#Dropping irrelevant columns
drop_cols = ['CaseOrder','City','State','County','Zip','Lng','Lat','Population','Area','TimeZone']
df_numpca.drop(columns=drop_cols,inplace=True)
numerics = ['uint8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_numpca = df_numpca.select_dtypes(include=numerics)
#Principal Component Analysis and plot
df_numpca_normalized=(df_numpca-df_numpca.mean())/df_numpca.std()
pca= PCA(n_components=df_numpca.shape[1])
pca.fit(df_numpca_normalized)
pca_names=['PC'+str(i) for i in range (1,len(df_numpca.columns)+1)]
df_numpca_pca=pd.DataFrame(pca.transform(df_numpca_normalized), columns = [*pca_names])
plt.plot(pca.explained_variance_ratio_,'o-')
plt.title('PCA graph')
plt.xlabel('Components')
plt.ylabel('Explained variance')
plt.ylim([-0.01,0.16])
#plt.xlim([-1,40])
plt.show()
#Eigenvalues and plot
cov_matrix= np.dot(df_numpca_normalized.T,df_numpca_normalized)/df_numpca.shape[0]
eigenvalues =[np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)) for eigenvector in pca.components_]
plt.plot(eigenvalues,'o-')
plt.xlabel('number of components')
plt.ylabel('eigenvalue')
plt.show()
#Loadings
loadings = pd.DataFrame(pca.components_.T,columns = [*pca_names],index=df_numpca.columns)
pca_xplots = [(0,1),(0,2),(0,3),(0,4),(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
for j,k in (pca_xplots):
plt.figure(figsize=(5,5))
plt.scatter(df_numpca_pca.iloc[:,j],df_numpca_pca.iloc[:,k],s=10,alpha=0.4,c=df_num.loc[:,target],cmap='plasma')
plt.colorbar();plt.grid('on')
plt.title(pca_names[j] +" Vs "+ pca_names[k]+" - Colored by " + target );plt.xlabel(pca_names[j]);plt.ylabel(pca_names[k])
plt.ylim([-0.01,0.16]);plt.ylim([-6,6]);plt.xlim([-4,4])
plt.show()
PCsortingcomponent=4
print('Sorted by: '+ pca_names[PCsortingcomponent-1])
display(loadings.abs().sort_values(by=pca_names[PCsortingcomponent-1],ascending =False))
print(pca_names[PCsortingcomponent-1] +' Main contributors sorted: \n\n',loadings.iloc[:,PCsortingcomponent-1].abs().sort_values(ascending =False))
#display(loadings.abs().sort_values(by=pca_names,ascending =False))
Sorted by: PC4
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | PC17 | PC18 | PC19 | PC20 | PC21 | PC22 | PC23 | PC24 | PC25 | PC26 | PC27 | PC28 | PC29 | PC30 | PC31 | PC32 | PC33 | PC34 | PC35 | PC36 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| StreamingMovies | 0.007581 | 0.082697 | 0.000262 | 0.491977 | 0.002869 | 0.188835 | 0.065912 | 0.274185 | 0.182115 | 0.132214 | 0.147521 | 0.009282 | 0.140368 | 0.235134 | 0.079190 | 0.020027 | 0.071876 | 0.265451 | 0.114765 | 0.004763 | 0.120719 | 0.223187 | 0.218039 | 0.034714 | 0.298826 | 0.063203 | 0.200321 | 0.155349 | 0.035846 | 0.002842 | 0.010444 | 0.007797 | 0.029515 | 0.344075 | 0.020737 | 0.032476 |
| Churn | 0.000537 | 0.472526 | 0.021086 | 0.453897 | 0.021638 | 0.020594 | 0.035237 | 0.013911 | 0.009681 | 0.012497 | 0.041883 | 0.016318 | 0.008013 | 0.017213 | 0.025577 | 0.000428 | 0.012875 | 0.000512 | 0.030711 | 0.013692 | 0.034469 | 0.002938 | 0.001223 | 0.025699 | 0.012477 | 0.012540 | 0.006859 | 0.005732 | 0.022036 | 0.010846 | 0.008233 | 0.002002 | 0.036000 | 0.745747 | 0.054355 | 0.005200 |
| StreamingTV | 0.000721 | 0.054804 | 0.021328 | 0.379004 | 0.048627 | 0.125987 | 0.033771 | 0.216248 | 0.072362 | 0.378668 | 0.135423 | 0.104623 | 0.447983 | 0.223634 | 0.212836 | 0.032315 | 0.095735 | 0.091239 | 0.203742 | 0.047028 | 0.045596 | 0.285874 | 0.125402 | 0.159588 | 0.153438 | 0.027862 | 0.090182 | 0.066222 | 0.007337 | 0.027787 | 0.008397 | 0.003142 | 0.006870 | 0.292614 | 0.029895 | 0.035836 |
| Contract | 0.001501 | 0.104656 | 0.012788 | 0.331955 | 0.073190 | 0.371125 | 0.124409 | 0.025272 | 0.225599 | 0.206919 | 0.200022 | 0.144769 | 0.170262 | 0.277016 | 0.087196 | 0.006943 | 0.082491 | 0.154111 | 0.136945 | 0.003033 | 0.003790 | 0.109729 | 0.469231 | 0.019258 | 0.155311 | 0.048724 | 0.228706 | 0.093797 | 0.041797 | 0.008134 | 0.000526 | 0.016576 | 0.001985 | 0.281734 | 0.020886 | 0.001311 |
| Bandwidth_GB_Year | 0.017122 | 0.604936 | 0.034122 | 0.298680 | 0.015519 | 0.009020 | 0.011998 | 0.023268 | 0.013211 | 0.003469 | 0.006766 | 0.007601 | 0.007482 | 0.013911 | 0.017268 | 0.000416 | 0.014116 | 0.022252 | 0.010705 | 0.007075 | 0.000616 | 0.001822 | 0.022782 | 0.028092 | 0.003208 | 0.001371 | 0.001875 | 0.005901 | 0.000781 | 0.000866 | 0.001610 | 0.005303 | 0.008873 | 0.201265 | 0.017201 | 0.705977 |
| InternetService | 0.005008 | 0.018604 | 0.004100 | 0.238989 | 0.143924 | 0.076694 | 0.050604 | 0.072805 | 0.369122 | 0.021535 | 0.088612 | 0.444106 | 0.011265 | 0.020008 | 0.148491 | 0.213540 | 0.193561 | 0.107531 | 0.109469 | 0.114315 | 0.307895 | 0.221301 | 0.024572 | 0.269113 | 0.309577 | 0.278854 | 0.047179 | 0.035404 | 0.103730 | 0.009465 | 0.003379 | 0.022876 | 0.006864 | 0.125455 | 0.001327 | 0.054198 |
| Tenure | 0.016509 | 0.616323 | 0.035459 | 0.229876 | 0.000942 | 0.044834 | 0.000889 | 0.004767 | 0.046761 | 0.006589 | 0.018133 | 0.050027 | 0.008858 | 0.011881 | 0.003175 | 0.009033 | 0.020721 | 0.028959 | 0.002482 | 0.014104 | 0.010960 | 0.005610 | 0.026192 | 0.037716 | 0.019731 | 0.010396 | 0.023171 | 0.008344 | 0.005511 | 0.001031 | 0.004420 | 0.004272 | 0.007559 | 0.241368 | 0.022441 | 0.703306 |
| Multiple | 0.000615 | 0.046412 | 0.012442 | 0.199469 | 0.086160 | 0.030945 | 0.039666 | 0.183407 | 0.366510 | 0.270889 | 0.042668 | 0.009783 | 0.346863 | 0.499341 | 0.103041 | 0.007182 | 0.139022 | 0.155688 | 0.029292 | 0.082243 | 0.035036 | 0.028059 | 0.116361 | 0.066653 | 0.098347 | 0.272628 | 0.250728 | 0.275807 | 0.060673 | 0.042335 | 0.018504 | 0.026514 | 0.023190 | 0.148478 | 0.032063 | 0.010555 |
| OnlineBackup | 0.005257 | 0.010424 | 0.019389 | 0.136434 | 0.196916 | 0.012535 | 0.212697 | 0.100905 | 0.036075 | 0.300307 | 0.246279 | 0.063623 | 0.042778 | 0.028872 | 0.225803 | 0.189235 | 0.199592 | 0.564801 | 0.050360 | 0.304277 | 0.166911 | 0.177622 | 0.170526 | 0.214864 | 0.026095 | 0.169207 | 0.111077 | 0.055572 | 0.002585 | 0.029259 | 0.003645 | 0.011469 | 0.011835 | 0.078259 | 0.006014 | 0.014577 |
| DeviceProtection | 0.002858 | 0.038429 | 0.017635 | 0.112216 | 0.252082 | 0.094312 | 0.349931 | 0.091473 | 0.134959 | 0.153213 | 0.369987 | 0.028751 | 0.060017 | 0.024028 | 0.252032 | 0.068464 | 0.233448 | 0.033021 | 0.013582 | 0.094980 | 0.345610 | 0.125131 | 0.149270 | 0.144687 | 0.263941 | 0.149408 | 0.379167 | 0.216004 | 0.037426 | 0.035508 | 0.022469 | 0.006581 | 0.017976 | 0.036233 | 0.002111 | 0.013490 |
| Techie | 0.008094 | 0.031940 | 0.024185 | 0.091791 | 0.230510 | 0.274604 | 0.144086 | 0.057471 | 0.052782 | 0.139601 | 0.208807 | 0.044136 | 0.295642 | 0.016449 | 0.235856 | 0.210325 | 0.333336 | 0.010948 | 0.451229 | 0.190112 | 0.028195 | 0.138340 | 0.261997 | 0.120165 | 0.037269 | 0.108367 | 0.261481 | 0.199204 | 0.046302 | 0.008148 | 0.002718 | 0.002826 | 0.003059 | 0.066916 | 0.006128 | 0.000221 |
| Phone | 0.005229 | 0.014443 | 0.035612 | 0.070828 | 0.205834 | 0.027407 | 0.213756 | 0.452829 | 0.095677 | 0.115040 | 0.157431 | 0.010012 | 0.146077 | 0.151349 | 0.086192 | 0.090444 | 0.303029 | 0.194542 | 0.146269 | 0.188821 | 0.115628 | 0.268290 | 0.277505 | 0.289896 | 0.289428 | 0.043911 | 0.219693 | 0.152332 | 0.020351 | 0.009957 | 0.013795 | 0.009608 | 0.023348 | 0.011655 | 0.026927 | 0.000192 |
| Gender | 0.006586 | 0.016673 | 0.006505 | 0.051472 | 0.175632 | 0.413557 | 0.090043 | 0.135460 | 0.007980 | 0.040148 | 0.145452 | 0.147354 | 0.059806 | 0.259977 | 0.148064 | 0.375396 | 0.314120 | 0.024359 | 0.162988 | 0.133769 | 0.251313 | 0.181319 | 0.058638 | 0.195369 | 0.326722 | 0.218066 | 0.203816 | 0.079514 | 0.016516 | 0.051113 | 0.037650 | 0.038639 | 0.021918 | 0.013969 | 0.005342 | 0.008385 |
| Outage_sec_perweek | 0.017846 | 0.000544 | 0.014935 | 0.044173 | 0.038536 | 0.382033 | 0.143953 | 0.000660 | 0.022539 | 0.019186 | 0.282489 | 0.356172 | 0.155071 | 0.097413 | 0.137551 | 0.255994 | 0.001346 | 0.247790 | 0.175769 | 0.148268 | 0.384654 | 0.058681 | 0.233870 | 0.070769 | 0.110579 | 0.313592 | 0.176450 | 0.158905 | 0.102102 | 0.015422 | 0.006803 | 0.013631 | 0.019338 | 0.003789 | 0.014285 | 0.000097 |
| Contacts | 0.009026 | 0.000290 | 0.010796 | 0.039768 | 0.182136 | 0.036704 | 0.378321 | 0.273111 | 0.169671 | 0.132915 | 0.116170 | 0.007508 | 0.180475 | 0.164197 | 0.247745 | 0.030843 | 0.509655 | 0.074997 | 0.001036 | 0.140032 | 0.146020 | 0.050714 | 0.053949 | 0.273092 | 0.057168 | 0.176322 | 0.079596 | 0.362993 | 0.053596 | 0.037978 | 0.002177 | 0.022794 | 0.018526 | 0.010191 | 0.002618 | 0.000259 |
| Children | 0.004048 | 0.014041 | 0.028746 | 0.035060 | 0.213805 | 0.285239 | 0.217991 | 0.183553 | 0.178764 | 0.113374 | 0.172381 | 0.218948 | 0.030042 | 0.068129 | 0.356657 | 0.140581 | 0.164610 | 0.046223 | 0.424081 | 0.205676 | 0.185524 | 0.035209 | 0.056108 | 0.045338 | 0.256907 | 0.034686 | 0.309492 | 0.238610 | 0.021585 | 0.014770 | 0.011588 | 0.016487 | 0.019149 | 0.019351 | 0.002844 | 0.021522 |
| PaperlessBilling | 0.005597 | 0.002064 | 0.007905 | 0.032583 | 0.298470 | 0.011140 | 0.187379 | 0.148047 | 0.321167 | 0.197850 | 0.076362 | 0.294153 | 0.149869 | 0.266052 | 0.108736 | 0.103878 | 0.205014 | 0.051534 | 0.009046 | 0.233636 | 0.149173 | 0.244783 | 0.110922 | 0.445697 | 0.084659 | 0.230053 | 0.153688 | 0.148411 | 0.024538 | 0.038843 | 0.031981 | 0.004057 | 0.013155 | 0.021229 | 0.010422 | 0.000413 |
| Age | 0.006554 | 0.003151 | 0.029252 | 0.031991 | 0.281359 | 0.199079 | 0.317906 | 0.223697 | 0.241218 | 0.035375 | 0.032454 | 0.065330 | 0.049535 | 0.058644 | 0.398310 | 0.057857 | 0.088120 | 0.115448 | 0.313296 | 0.094137 | 0.095193 | 0.184430 | 0.305925 | 0.138355 | 0.122598 | 0.108357 | 0.425755 | 0.010427 | 0.051827 | 0.001386 | 0.005858 | 0.006940 | 0.002985 | 0.003225 | 0.013610 | 0.021837 |
| Port_modem | 0.000962 | 0.004563 | 0.013163 | 0.031367 | 0.165757 | 0.222433 | 0.079278 | 0.062348 | 0.329119 | 0.108650 | 0.190705 | 0.097887 | 0.042661 | 0.050778 | 0.019677 | 0.222790 | 0.110425 | 0.384190 | 0.268489 | 0.602353 | 0.175339 | 0.159596 | 0.004923 | 0.089580 | 0.150200 | 0.002465 | 0.087251 | 0.039124 | 0.002351 | 0.006986 | 0.005048 | 0.013834 | 0.007564 | 0.013736 | 0.000279 | 0.000262 |
| Tablet | 0.016434 | 0.002069 | 0.000784 | 0.026981 | 0.192129 | 0.127834 | 0.249412 | 0.428005 | 0.140560 | 0.256156 | 0.050994 | 0.003623 | 0.246007 | 0.061726 | 0.130548 | 0.223382 | 0.002310 | 0.141636 | 0.021905 | 0.274867 | 0.232469 | 0.066469 | 0.181437 | 0.290499 | 0.221037 | 0.218813 | 0.096839 | 0.317336 | 0.015318 | 0.010816 | 0.004379 | 0.030316 | 0.029644 | 0.013243 | 0.007893 | 0.000142 |
| Job | 0.001049 | 0.001242 | 0.031493 | 0.022836 | 0.329722 | 0.086776 | 0.054761 | 0.148439 | 0.174482 | 0.094954 | 0.316065 | 0.091679 | 0.134239 | 0.247762 | 0.075445 | 0.365857 | 0.085401 | 0.120046 | 0.310262 | 0.066254 | 0.090188 | 0.441358 | 0.087704 | 0.200175 | 0.289308 | 0.038267 | 0.008945 | 0.171746 | 0.004539 | 0.043252 | 0.022277 | 0.010566 | 0.001547 | 0.013080 | 0.005546 | 0.000194 |
| Item8 | 0.308517 | 0.009701 | 0.131222 | 0.022130 | 0.044055 | 0.020209 | 0.023097 | 0.021846 | 0.009539 | 0.002297 | 0.007111 | 0.001344 | 0.013108 | 0.034616 | 0.049282 | 0.022911 | 0.011457 | 0.028014 | 0.014023 | 0.001732 | 0.030549 | 0.028959 | 0.084343 | 0.079474 | 0.078502 | 0.125008 | 0.017579 | 0.039042 | 0.905958 | 0.021946 | 0.109836 | 0.044369 | 0.043579 | 0.014673 | 0.043056 | 0.000205 |
| Income | 0.001292 | 0.001276 | 0.026129 | 0.021508 | 0.063252 | 0.236743 | 0.216279 | 0.008287 | 0.096790 | 0.022233 | 0.548237 | 0.142225 | 0.026822 | 0.129340 | 0.048227 | 0.313323 | 0.058301 | 0.303177 | 0.210185 | 0.151993 | 0.055583 | 0.171334 | 0.222607 | 0.036539 | 0.326140 | 0.049264 | 0.103442 | 0.264952 | 0.018032 | 0.084215 | 0.008163 | 0.000461 | 0.004042 | 0.007930 | 0.013115 | 0.000402 |
| PaymentMethod | 0.002023 | 0.004653 | 0.008694 | 0.016911 | 0.193016 | 0.013031 | 0.482436 | 0.005309 | 0.001050 | 0.213768 | 0.005256 | 0.146747 | 0.080588 | 0.035837 | 0.217943 | 0.239851 | 0.222119 | 0.229137 | 0.259887 | 0.073276 | 0.023716 | 0.222274 | 0.340821 | 0.303247 | 0.196091 | 0.084045 | 0.152228 | 0.214422 | 0.001924 | 0.017013 | 0.008718 | 0.001909 | 0.027474 | 0.025062 | 0.009451 | 0.000126 |
| Item4 | 0.145867 | 0.017667 | 0.568449 | 0.016728 | 0.009547 | 0.019590 | 0.000550 | 0.010518 | 0.000431 | 0.001171 | 0.002647 | 0.001226 | 0.008234 | 0.028154 | 0.004449 | 0.023287 | 0.011201 | 0.021615 | 0.006792 | 0.020525 | 0.031087 | 0.015592 | 0.021219 | 0.004305 | 0.022567 | 0.030135 | 0.056059 | 0.002538 | 0.168816 | 0.475272 | 0.437759 | 0.439274 | 0.079208 | 0.003252 | 0.020300 | 0.000141 |
| OnlineSecurity | 0.000666 | 0.015857 | 0.013712 | 0.016481 | 0.161994 | 0.293491 | 0.017085 | 0.146290 | 0.226535 | 0.339323 | 0.113024 | 0.338433 | 0.148980 | 0.027338 | 0.389467 | 0.059300 | 0.039427 | 0.243271 | 0.134325 | 0.069902 | 0.090809 | 0.088720 | 0.148535 | 0.012312 | 0.090009 | 0.305908 | 0.257477 | 0.315729 | 0.032739 | 0.045670 | 0.008865 | 0.010665 | 0.007603 | 0.010017 | 0.003164 | 0.011343 |
| Marital | 0.001546 | 0.000693 | 0.046799 | 0.014349 | 0.347844 | 0.076407 | 0.171580 | 0.141735 | 0.125403 | 0.126416 | 0.161881 | 0.051797 | 0.239772 | 0.272756 | 0.034559 | 0.403500 | 0.221978 | 0.003933 | 0.054662 | 0.208735 | 0.145218 | 0.304789 | 0.270860 | 0.351637 | 0.006516 | 0.203486 | 0.057540 | 0.047488 | 0.093998 | 0.005813 | 0.013080 | 0.025475 | 0.020917 | 0.026662 | 0.007244 | 0.000384 |
| Item2 | 0.433646 | 0.026643 | 0.282883 | 0.014344 | 0.019644 | 0.009868 | 0.009631 | 0.028512 | 0.008858 | 0.007923 | 0.004649 | 0.012385 | 0.005099 | 0.024432 | 0.003536 | 0.022874 | 0.000507 | 0.004931 | 0.003172 | 0.002269 | 0.008633 | 0.023442 | 0.004420 | 0.002462 | 0.025413 | 0.000172 | 0.016986 | 0.000871 | 0.115918 | 0.165363 | 0.062985 | 0.065290 | 0.589667 | 0.069944 | 0.570012 | 0.000323 |
| Item7 | 0.358071 | 0.007741 | 0.179488 | 0.012422 | 0.000581 | 0.030099 | 0.005011 | 0.000029 | 0.026424 | 0.005032 | 0.012128 | 0.011651 | 0.028360 | 0.030429 | 0.070849 | 0.025301 | 0.007576 | 0.052266 | 0.031587 | 0.001704 | 0.000793 | 0.005011 | 0.004932 | 0.052925 | 0.063855 | 0.005144 | 0.025634 | 0.002335 | 0.161129 | 0.799337 | 0.377178 | 0.073540 | 0.065202 | 0.009135 | 0.041380 | 0.000220 |
| Item1 | 0.458525 | 0.019237 | 0.280150 | 0.012252 | 0.013760 | 0.011502 | 0.011157 | 0.005069 | 0.010260 | 0.017610 | 0.002120 | 0.001586 | 0.008299 | 0.010941 | 0.011622 | 0.024066 | 0.011992 | 0.024512 | 0.013010 | 0.008374 | 0.011067 | 0.036225 | 0.009558 | 0.005490 | 0.004189 | 0.028611 | 0.008033 | 0.017221 | 0.064634 | 0.112616 | 0.041930 | 0.024621 | 0.242076 | 0.048698 | 0.790580 | 0.001637 |
| Item5 | 0.175640 | 0.035229 | 0.586517 | 0.008998 | 0.041244 | 0.014361 | 0.010206 | 0.015057 | 0.002980 | 0.016682 | 0.002886 | 0.009036 | 0.004974 | 0.000170 | 0.024448 | 0.011800 | 0.034503 | 0.000754 | 0.013238 | 0.008590 | 0.012133 | 0.029038 | 0.026569 | 0.002991 | 0.020815 | 0.031205 | 0.008809 | 0.011571 | 0.139159 | 0.061839 | 0.198011 | 0.697660 | 0.254121 | 0.007585 | 0.043864 | 0.000407 |
| Item6 | 0.404585 | 0.002234 | 0.182831 | 0.008509 | 0.003182 | 0.014235 | 0.008401 | 0.016793 | 0.012646 | 0.016567 | 0.011261 | 0.015698 | 0.010036 | 0.021253 | 0.020205 | 0.033475 | 0.010410 | 0.001307 | 0.009244 | 0.016679 | 0.011592 | 0.032672 | 0.020003 | 0.002218 | 0.007493 | 0.010820 | 0.010365 | 0.019902 | 0.064670 | 0.061767 | 0.759665 | 0.393332 | 0.229075 | 0.001731 | 0.065713 | 0.000282 |
| TechSupport | 0.024768 | 0.005585 | 0.008209 | 0.007814 | 0.139912 | 0.223760 | 0.014600 | 0.222772 | 0.180581 | 0.155007 | 0.033731 | 0.413021 | 0.263834 | 0.091080 | 0.280394 | 0.190067 | 0.138327 | 0.082896 | 0.071615 | 0.175369 | 0.438049 | 0.101027 | 0.096364 | 0.162935 | 0.224837 | 0.131862 | 0.154490 | 0.252284 | 0.039775 | 0.011047 | 0.028672 | 0.003537 | 0.001206 | 0.027241 | 0.013359 | 0.001368 |
| Yearly_equip_failure | 0.007599 | 0.016617 | 0.006955 | 0.006103 | 0.035001 | 0.025663 | 0.023875 | 0.075558 | 0.220335 | 0.439591 | 0.035618 | 0.300852 | 0.331503 | 0.334064 | 0.007379 | 0.149661 | 0.145620 | 0.038320 | 0.178172 | 0.036177 | 0.259980 | 0.265433 | 0.037514 | 0.030208 | 0.120815 | 0.405611 | 0.121979 | 0.149325 | 0.023268 | 0.010694 | 0.016649 | 0.002414 | 0.007886 | 0.011367 | 0.019725 | 0.000039 |
| 0.008849 | 0.018048 | 0.006949 | 0.002488 | 0.314679 | 0.044164 | 0.033717 | 0.322053 | 0.257515 | 0.144767 | 0.097793 | 0.184984 | 0.285904 | 0.273286 | 0.127237 | 0.020267 | 0.153369 | 0.125704 | 0.092921 | 0.293498 | 0.196155 | 0.210775 | 0.110106 | 0.061643 | 0.073615 | 0.372168 | 0.131436 | 0.303329 | 0.050218 | 0.018361 | 0.017299 | 0.007160 | 0.016786 | 0.001767 | 0.001135 | 0.000047 | |
| Item3 | 0.400185 | 0.024294 | 0.280546 | 0.000229 | 0.019831 | 0.013126 | 0.010342 | 0.023312 | 0.003661 | 0.007038 | 0.018317 | 0.025529 | 0.005056 | 0.022441 | 0.020544 | 0.011477 | 0.003578 | 0.045683 | 0.025014 | 0.009629 | 0.010890 | 0.006915 | 0.026626 | 0.003853 | 0.021886 | 0.017654 | 0.010491 | 0.004636 | 0.170711 | 0.252520 | 0.160266 | 0.383227 | 0.675249 | 0.013493 | 0.175998 | 0.000849 |
PC4 Main contributors sorted: StreamingMovies 0.491977 Churn 0.453897 StreamingTV 0.379004 Contract 0.331955 Bandwidth_GB_Year 0.298680 InternetService 0.238989 Tenure 0.229876 Multiple 0.199469 OnlineBackup 0.136434 DeviceProtection 0.112216 Techie 0.091791 Phone 0.070828 Gender 0.051472 Outage_sec_perweek 0.044173 Contacts 0.039768 Children 0.035060 PaperlessBilling 0.032583 Age 0.031991 Port_modem 0.031367 Tablet 0.026981 Job 0.022836 Item8 0.022130 Income 0.021508 PaymentMethod 0.016911 Item4 0.016728 OnlineSecurity 0.016481 Marital 0.014349 Item2 0.014344 Item7 0.012422 Item1 0.012252 Item5 0.008998 Item6 0.008509 TechSupport 0.007814 Yearly_equip_failure 0.006103 Email 0.002488 Item3 0.000229 Name: PC4, dtype: float64
data preparation goals:
# selecting only the numeric columns
numerics = ['uint8','uint32','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num_only = df_raw.select_dtypes(include=numerics) # selecting only the numeric columns
Standardization of numeric variables
# Standardization of numeric variables
target='MonthlyCharge'
df_num_only_std=(df_num_only-df_num_only.mean())/df_num_only.std()
mlr_df=df_num_only_std.copy()
Initializing Test model
X=mlr_df.copy()
# adding categorical columns excluded earlier
catcols=[coln for coln in list(df_raw.columns) if coln not in list(X.columns)]
X[catcols]=df_raw.loc[:,catcols]
X=cat2num(X) # To be able to correlate the categorical variables
X= X.select_dtypes(include=numerics) # removing the non-numeric columns
X=X.drop([target],axis=1) #dropping te target column from the predictors data set
X = sm.add_constant(X) #adding constant that will represent the intercept of the MLR model
y = mlr_df[target] # assigning the target to vector y
# Splitting the data into train and test portions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.30)
%%time
# MLR implementation
linear_regression = sm.OLS(y_train,X_train)
fitted_model = linear_regression.fit()
residuals =fitted_model.predict(X_test)-y_test
print(fitted_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: MonthlyCharge R-squared: 0.924
Model: OLS Adj. R-squared: 0.924
Method: Least Squares F-statistic: 1840.
Date: Thu, 15 Apr 2021 Prob (F-statistic): 0.00
Time: 12:50:16 Log-Likelihood: -849.64
No. Observations: 7000 AIC: 1793.
Df Residuals: 6953 BIC: 2115.
Df Model: 46
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -2.2446 0.022 -100.875 0.000 -2.288 -2.201
CaseOrder 0.0039 0.006 0.655 0.512 -0.008 0.015
Zip 0.0012 0.010 0.119 0.906 -0.019 0.021
Lat -0.0040 0.004 -1.046 0.295 -0.011 0.003
Lng -2.722e-05 0.010 -0.003 0.998 -0.019 0.019
Population -0.0046 0.003 -1.344 0.179 -0.011 0.002
Children 0.1453 0.004 37.512 0.000 0.138 0.153
Age -0.1472 0.004 -37.829 0.000 -0.155 -0.140
Income -0.0008 0.003 -0.258 0.796 -0.007 0.006
Outage_sec_perweek -0.0009 0.003 -0.260 0.795 -0.007 0.006
Email -0.0035 0.003 -1.069 0.285 -0.010 0.003
Contacts -0.0026 0.003 -0.797 0.425 -0.009 0.004
Yearly_equip_failure -0.0031 0.003 -0.955 0.340 -0.009 0.003
Tenure 4.8029 0.067 71.527 0.000 4.671 4.935
Bandwidth_GB_Year -4.8257 0.067 -71.720 0.000 -4.958 -4.694
Item1 -0.0005 0.005 -0.105 0.916 -0.010 0.009
Item2 0.0095 0.005 2.075 0.038 0.001 0.018
Item3 -0.0038 0.004 -0.916 0.360 -0.012 0.004
Item4 0.0009 0.004 0.255 0.799 -0.006 0.008
Item5 -0.0007 0.004 -0.182 0.856 -0.008 0.007
Item6 -0.0004 0.004 -0.102 0.918 -0.008 0.007
Item7 -0.0027 0.004 -0.715 0.474 -0.010 0.005
Item8 -0.0017 0.004 -0.482 0.630 -0.009 0.005
City -2.82e-07 1.88e-06 -0.150 0.881 -3.97e-06 3.41e-06
State 4.368e-05 0.000 0.184 0.854 -0.000 0.001
County 1.155e-06 7.38e-06 0.157 0.876 -1.33e-05 1.56e-05
Area 0.0009 0.004 0.218 0.828 -0.007 0.009
TimeZone 0.0003 0.001 0.433 0.665 -0.001 0.002
Job -9.652e-06 1.78e-05 -0.541 0.588 -4.46e-05 2.53e-05
Marital 7.785e-05 0.002 0.034 0.973 -0.004 0.005
Gender 0.0994 0.006 15.992 0.000 0.087 0.112
Churn 0.0899 0.010 8.801 0.000 0.070 0.110
Techie 0.0158 0.009 1.792 0.073 -0.001 0.033
Contract 0.0142 0.004 3.443 0.001 0.006 0.022
Port_modem -0.0075 0.007 -1.147 0.251 -0.020 0.005
Tablet -0.0114 0.007 -1.585 0.113 -0.025 0.003
InternetService -0.5859 0.008 -70.021 0.000 -0.602 -0.569
Phone -0.0104 0.011 -0.912 0.362 -0.033 0.012
Multiple 0.9045 0.007 128.830 0.000 0.891 0.918
OnlineSecurity 0.2333 0.007 32.351 0.000 0.219 0.247
OnlineBackup 0.7263 0.007 101.007 0.000 0.712 0.740
DeviceProtection 0.4752 0.007 66.666 0.000 0.461 0.489
TechSupport 0.2950 0.007 43.510 0.000 0.282 0.308
StreamingTV 1.4650 0.010 150.658 0.000 1.446 1.484
StreamingMovies 1.6570 0.009 174.592 0.000 1.638 1.676
PaperlessBilling 0.0081 0.007 1.207 0.228 -0.005 0.021
PaymentMethod -0.0026 0.003 -0.857 0.392 -0.009 0.003
==============================================================================
Omnibus: 40109.708 Durbin-Watson: 1.982
Prob(Omnibus): 0.000 Jarque-Bera (JB): 681.182
Skew: -0.002 Prob(JB): 1.21e-148
Kurtosis: 1.472 Cond. No. 1.04e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Wall time: 143 ms
# MLR examination plots
residuals.hist(bins = 50)
plt.title('Residuals')
plt.xlabel('residual(std.)')
plt.show()
plt.scatter(X_test.index,residuals,s=3,alpha=0.4,c=y_test,cmap='plasma')
plt.title('Residuals positions (colored by "'+ target+'")');plt.xlabel('Test data idx');plt.ylabel('Test data residual (std.)')
plt.colorbar();plt.grid('on')
plt.show()
# QQ-Plot
fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(residuals, plot=ax) # QQ-Plot
plt.title('Residuals normality')
plt.tight_layout()
plt.show()
fitted_model.pvalues.sort_values(ascending=True)
const 0.000000e+00 InternetService 0.000000e+00 Multiple 0.000000e+00 Bandwidth_GB_Year 0.000000e+00 OnlineBackup 0.000000e+00 DeviceProtection 0.000000e+00 Tenure 0.000000e+00 StreamingMovies 0.000000e+00 TechSupport 0.000000e+00 StreamingTV 0.000000e+00 Age 6.163214e-285 Children 1.268647e-280 OnlineSecurity 5.201959e-214 Gender 1.478644e-56 Churn 1.691906e-18 Contract 5.786299e-04 Item2 3.801865e-02 Techie 7.318178e-02 Tablet 1.129592e-01 Population 1.789391e-01 PaperlessBilling 2.276264e-01 Port_modem 2.514822e-01 Email 2.850736e-01 Lat 2.954272e-01 Yearly_equip_failure 3.396475e-01 Item3 3.596141e-01 Phone 3.618765e-01 PaymentMethod 3.917170e-01 Contacts 4.254609e-01 Item7 4.744922e-01 CaseOrder 5.121795e-01 Job 5.883901e-01 Item8 6.299214e-01 TimeZone 6.648696e-01 Outage_sec_perweek 7.951892e-01 Income 7.964818e-01 Item4 7.989232e-01 Area 8.277303e-01 State 8.536626e-01 Item5 8.557213e-01 County 8.755916e-01 City 8.809497e-01 Zip 9.056529e-01 Item1 9.162757e-01 Item6 9.184061e-01 Marital 9.731578e-01 Lng 9.977253e-01 dtype: float64
signific=0.05
test_significant_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value <signific]
test_nonsign_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value >=signific]
print('test Significant columns: \n',test_significant_columns)
print('\n test Non Significant columns: \n',test_nonsign_columns)
test Significant columns: ['const', 'Children', 'Age', 'Tenure', 'Bandwidth_GB_Year', 'Item2', 'Gender', 'Churn', 'Contract', 'InternetService', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'] test Non Significant columns: ['CaseOrder', 'Zip', 'Lat', 'Lng', 'Population', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Item1', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8', 'City', 'State', 'County', 'Area', 'TimeZone', 'Job', 'Marital', 'Techie', 'Port_modem', 'Tablet', 'Phone', 'PaperlessBilling', 'PaymentMethod']
X_temp=X.copy()
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_temp.values, i) for i in range(X_temp.values.shape[1])]
vif["features"] = X_temp.columns
print(vif.round(1).sort_values(by="VIF Factor",ascending=False))
VIF Factor features 14 422.2 Bandwidth_GB_Year 13 420.6 Tenure 0 45.8 const 2 9.4 Zip 4 7.9 Lng 36 3.5 InternetService 1 3.3 CaseOrder 43 2.2 StreamingTV 15 2.2 Item1 44 2.1 StreamingMovies 31 1.9 Churn 16 1.9 Item2 27 1.8 TimeZone 17 1.6 Item3 20 1.5 Item6 6 1.4 Children 19 1.4 Item5 7 1.4 Age 3 1.3 Lat 18 1.3 Item4 21 1.3 Item7 41 1.2 DeviceProtection 22 1.2 Item8 40 1.2 OnlineBackup 5 1.1 Population 30 1.1 Gender 38 1.1 Multiple 24 1.1 State 39 1.1 OnlineSecurity 33 1.1 Contract 37 1.0 Phone 42 1.0 TechSupport 35 1.0 Tablet 45 1.0 PaperlessBilling 23 1.0 City 34 1.0 Port_modem 32 1.0 Techie 29 1.0 Marital 28 1.0 Job 26 1.0 Area 25 1.0 County 12 1.0 Yearly_equip_failure 11 1.0 Contacts 10 1.0 Email 9 1.0 Outage_sec_perweek 8 1.0 Income 46 1.0 PaymentMethod
X_temp=X.copy()
droplist=['Bandwidth_GB_Year','Item2','Gender'] # list of features to remove based on VIF and modelling iterations
X_temp=X_temp.drop(droplist,axis=1)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_temp.values, i) for i in range(X_temp.values.shape[1])]
vif["features"] = X_temp.columns
print(vif.round(1).sort_values(by="VIF Factor",ascending=False))
VIF Factor features 0 41.6 const 2 9.4 Zip 4 7.9 Lng 13 3.7 Tenure 1 3.3 CaseOrder 28 1.9 Churn 14 1.8 Item1 25 1.8 TimeZone 18 1.5 Item6 15 1.5 Item3 17 1.4 Item5 19 1.3 Item7 3 1.3 Lat 16 1.3 Item4 20 1.2 Item8 41 1.2 StreamingMovies 30 1.1 Contract 40 1.1 StreamingTV 22 1.1 State 5 1.1 Population 12 1.0 Yearly_equip_failure 32 1.0 Tablet 42 1.0 PaperlessBilling 6 1.0 Children 39 1.0 TechSupport 38 1.0 DeviceProtection 37 1.0 OnlineBackup 36 1.0 OnlineSecurity 35 1.0 Multiple 34 1.0 Phone 33 1.0 InternetService 31 1.0 Port_modem 11 1.0 Contacts 7 1.0 Age 29 1.0 Techie 8 1.0 Income 27 1.0 Marital 26 1.0 Job 9 1.0 Outage_sec_perweek 24 1.0 Area 23 1.0 County 10 1.0 Email 21 1.0 City 43 1.0 PaymentMethod
test_nonsign_columns.extend(droplist)
test_significant_columns = [col for col in test_significant_columns if col not in test_nonsign_columns]
print('test Significant columns: \n',test_significant_columns)
print('\n test Non Significant columns: \n',test_nonsign_columns)
test Significant columns: ['const', 'Children', 'Age', 'Tenure', 'Churn', 'Contract', 'InternetService', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'] test Non Significant columns: ['CaseOrder', 'Zip', 'Lat', 'Lng', 'Population', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Item1', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8', 'City', 'State', 'County', 'Area', 'TimeZone', 'Job', 'Marital', 'Techie', 'Port_modem', 'Tablet', 'Phone', 'PaperlessBilling', 'PaymentMethod', 'Bandwidth_GB_Year', 'Item2', 'Gender']
Right in the next section.
%%time
# initializing the model
X=mlr_df.copy()
# adding categorical columns excluded earlier
catcols=[coln for coln in list(df_raw.columns) if coln not in list(X.columns)]
X[catcols]=df_raw.loc[:,catcols]
X=cat2num(X) # To be able to correlate the categorical variables
X= X.select_dtypes(include=numerics) # removing the non-numeric columns
X=X.drop(test_nonsign_columns,axis=1) #dropping non significant columns from the predictors data set
X=X.drop([target],axis=1) #dropping the target column from the predictors data set
X = sm.add_constant(X) #adding constant that will represent the intercept of the MLR model
y = mlr_df[target] # assigning the target to vector y
Wall time: 8.73 s
# Saving a copy of the prepared data set.
D208_t1_input=X.copy()
D208_t1_input['data-MontlyCharge']=y
D208_t1_input.to_csv('D208_t1_model_input.csv', index = False)
# Splitting the data into train and test portions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.30)
# MLR implementation
linear_regression = sm.OLS(y_train,X_train)
fitted_model = linear_regression.fit()
residuals =fitted_model.predict(X_test)-y_test
# MLR summary:
print(fitted_model.summary())
# MLR examination plots
residuals.hist(bins = 50)
plt.title('Residuals')
plt.xlabel('residual(std.)')
plt.show()
plt.scatter(X_test.index,residuals,s=3,alpha=0.4,c=y_test,cmap='plasma')
plt.title('Residuals positions (colored by "'+ target+'")');plt.xlabel('Test data idx');plt.ylabel('Test data residual (std.)')
plt.colorbar();plt.grid('on')
plt.show()
# QQ-Plot
fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(residuals, plot=ax) # QQ-Plot
plt.title('Residuals normality')
plt.tight_layout()
plt.show()
OLS Regression Results
==============================================================================
Dep. Variable: MonthlyCharge R-squared: 0.869
Model: OLS Adj. R-squared: 0.869
Method: Least Squares F-statistic: 3580.
Date: Thu, 15 Apr 2021 Prob (F-statistic): 0.00
Time: 12:52:28 Log-Likelihood: -2840.6
No. Observations: 7000 AIC: 5709.
Df Residuals: 6986 BIC: 5805.
Df Model: 13
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const -1.8654 0.013 -145.295 0.000 -1.891 -1.840
Children 0.0005 0.004 0.113 0.910 -0.008 0.009
Age -0.0019 0.004 -0.439 0.661 -0.010 0.007
Tenure 0.0097 0.005 1.877 0.061 -0.000 0.020
Churn 0.0295 0.013 2.206 0.027 0.003 0.056
Contract 0.0066 0.006 1.194 0.232 -0.004 0.017
InternetService -0.0827 0.006 -13.917 0.000 -0.094 -0.071
Multiple 0.7689 0.009 86.674 0.000 0.752 0.786
OnlineSecurity 0.0706 0.009 7.787 0.000 0.053 0.088
OnlineBackup 0.5253 0.009 59.985 0.000 0.508 0.542
DeviceProtection 0.2962 0.009 33.746 0.000 0.279 0.313
TechSupport 0.2790 0.009 30.955 0.000 0.261 0.297
StreamingTV 0.9789 0.009 107.217 0.000 0.961 0.997
StreamingMovies 1.2121 0.009 129.793 0.000 1.194 1.230
==============================================================================
Omnibus: 2480.325 Durbin-Watson: 2.019
Prob(Omnibus): 0.000 Jarque-Bera (JB): 348.929
Skew: 0.095 Prob(JB): 1.70e-76
Kurtosis: 1.923 Cond. No. 7.04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Mainly based on statistical significance (P>|t|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of MonthlyCharge). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.
Note: The output should include a screenshot of each model.
signific=0.05
initial_significant_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value <signific]
initial_nonsign_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value >=signific]
print('initial Significant columns: \n',initial_significant_columns)
print('\n initial Non Significant columns: \n',initial_nonsign_columns)
initial Significant columns: ['const', 'Churn', 'InternetService', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'] initial Non Significant columns: ['Children', 'Age', 'Tenure', 'Contract']
initial_nonsign_columns.extend(test_nonsign_columns)
print('initial Significant columns: \n',initial_significant_columns)
print('\n initial Non Significant columns: \n',initial_nonsign_columns)
initial Significant columns: ['const', 'Churn', 'InternetService', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'] initial Non Significant columns: ['Children', 'Age', 'Tenure', 'Contract', 'CaseOrder', 'Zip', 'Lat', 'Lng', 'Population', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Item1', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8', 'City', 'State', 'County', 'Area', 'TimeZone', 'Job', 'Marital', 'Techie', 'Port_modem', 'Tablet', 'Phone', 'PaperlessBilling', 'PaymentMethod', 'Bandwidth_GB_Year', 'Item2', 'Gender']
%%time
# initializing the model
X=mlr_df.copy()
# adding categorical columns excluded earlier
catcols=[coln for coln in list(df_raw.columns) if coln not in list(X.columns)]
X[catcols]=df_raw.loc[:,catcols]
X = sm.add_constant(X) #adding constant that will represent the intercept of the MLR model
X=X[initial_significant_columns] #dropping nonsignificant columns from the predictors data set
#X=X.drop([target],axis=1) #dropping te target column from the predictors data set
# Adding dummy variables
dummy_cols = [dummy for dummy in categ_variables if dummy in X.columns]
X=pd.get_dummies(X,drop_first=True,columns=dummy_cols)
X=cat2num(X) # To be able to correlate the categorical variables
X= X.select_dtypes(include=numerics) # removing the non-numeric columns
y = mlr_df[target] # assigning the target to vector y
# Splitting the data into train and test portions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.30)
# MLR implementation
linear_regression = sm.OLS(y_train,X_train)
fitted_model = linear_regression.fit()
residuals =fitted_model.predict(X_test)-y_test
# MLR summary:
print(fitted_model.summary())
# MLR examination plots
residuals.hist(bins = 50)
plt.title('Residuals')
plt.xlabel('residual(std.)')
plt.show()
plt.scatter(X_test.index,residuals,s=3,alpha=0.4,c=y_test,cmap='plasma')
plt.title('Residuals positions (colored by "'+ target+'")');plt.xlabel('Test data idx');plt.ylabel('Test data residual (std.)')
plt.colorbar();plt.grid('on')
plt.show()
# QQ-Plot
fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(residuals, plot=ax) # QQ-Plot
plt.title('Residuals normality')
plt.tight_layout()
plt.show()
OLS Regression Results
==============================================================================
Dep. Variable: MonthlyCharge R-squared: 0.958
Model: OLS Adj. R-squared: 0.958
Method: Least Squares F-statistic: 1.595e+04
Date: Thu, 15 Apr 2021 Prob (F-statistic): 0.00
Time: 12:53:03 Log-Likelihood: 1209.2
No. Observations: 7000 AIC: -2396.
Df Residuals: 6989 BIC: -2321.
Df Model: 10
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const -2.0696 0.007 -289.527 0.000 -2.084 -2.056
Churn_Yes 0.0492 0.006 8.129 0.000 0.037 0.061
InternetService_Fiber Optic 0.4735 0.006 85.137 0.000 0.463 0.484
InternetService_None -0.2967 0.007 -44.145 0.000 -0.310 -0.284
Multiple_Yes 0.7509 0.005 152.123 0.000 0.741 0.761
OnlineSecurity_Yes 0.0672 0.005 13.231 0.000 0.057 0.077
OnlineBackup_Yes 0.5242 0.005 106.860 0.000 0.515 0.534
DeviceProtection_Yes 0.2874 0.005 58.406 0.000 0.278 0.297
TechSupport_Yes 0.2857 0.005 56.998 0.000 0.276 0.295
StreamingTV_Yes 0.9697 0.005 192.304 0.000 0.960 0.980
StreamingMovies_Yes 1.2072 0.005 236.061 0.000 1.197 1.217
==============================================================================
Omnibus: 24927.818 Durbin-Watson: 1.967
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1110.724
Skew: 0.051 Prob(JB): 6.45e-242
Kurtosis: 1.051 Cond. No. 6.41
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Wall time: 1.66 s
The model applied to unstandardized variables, for more interpretable results
%%time
# No Standardization
mlr_df=df_num_only.copy()
# initializing the model
X=mlr_df.copy()
# adding categorical columns excluded earlier
catcols=[coln for coln in list(df_raw.columns) if coln not in list(X.columns)]
X[catcols]=df_raw.loc[:,catcols]
X = sm.add_constant(X) #adding constant that will represent the intercept of the MLR model
X=X[initial_significant_columns] #dropping nonsignificant columns from the predictors data set
#X=X.drop([target],axis=1) #dropping te target column from the predictors data set
# Adding dummy variables
dummy_cols = [dummy for dummy in categ_variables if dummy in X.columns]
X=pd.get_dummies(X,drop_first=True,columns=dummy_cols)
X=cat2num(X) # To be able to correlate the categorical variables
X= X.select_dtypes(include=numerics) # removing the non-numeric columns
y = mlr_df[target] # assigning the target to vector y
# Splitting the data into train and test portions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.30)
# MLR implementation
linear_regression = sm.OLS(y_train,X_train)
fitted_model = linear_regression.fit()
residuals =fitted_model.predict(X_test)-y_test
# MLR summary:
print(fitted_model.summary())
# MLR examination plots
residuals.hist(bins = 50)
plt.title('Residuals')
plt.xlabel('residual($)')
plt.show()
plt.scatter(X_test.index,residuals,s=3,alpha=0.4,c=y_test,cmap='plasma')
plt.title('Residuals positions (colored by "'+ target+'")');plt.xlabel('Test data idx');plt.ylabel('Test data residual ($)')
plt.colorbar();plt.grid('on')
plt.show()
# QQ-Plot
fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(residuals, plot=ax) # QQ-Plot
plt.title('Residuals normality')
plt.tight_layout()
plt.show()
OLS Regression Results
==============================================================================
Dep. Variable: MonthlyCharge R-squared: 0.959
Model: OLS Adj. R-squared: 0.959
Method: Least Squares F-statistic: 1.618e+04
Date: Thu, 15 Apr 2021 Prob (F-statistic): 0.00
Time: 12:53:41 Log-Likelihood: -25108.
No. Observations: 7000 AIC: 5.024e+04
Df Residuals: 6989 BIC: 5.031e+04
Df Model: 10
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const 83.8395 0.305 274.620 0.000 83.241 84.438
Churn_Yes 2.3076 0.261 8.853 0.000 1.797 2.819
InternetService_Fiber Optic 20.0667 0.238 84.291 0.000 19.600 20.533
InternetService_None -12.6781 0.292 -43.433 0.000 -13.250 -12.106
Multiple_Yes 32.3942 0.212 152.949 0.000 31.979 32.809
OnlineSecurity_Yes 2.5658 0.218 11.769 0.000 2.138 2.993
OnlineBackup_Yes 22.3806 0.210 106.328 0.000 21.968 22.793
DeviceProtection_Yes 12.4381 0.211 58.810 0.000 12.023 12.853
TechSupport_Yes 12.5581 0.217 57.911 0.000 12.133 12.983
StreamingTV_Yes 41.6414 0.216 192.774 0.000 41.218 42.065
StreamingMovies_Yes 51.7997 0.220 235.582 0.000 51.369 52.231
==============================================================================
Omnibus: 24973.130 Durbin-Watson: 1.942
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1105.922
Skew: 0.037 Prob(JB): 7.12e-241
Kurtosis: 1.054 Cond. No. 6.40
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Wall time: 1.96 s
Mainly based on statistical significance (P>|t|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of MonthlyCharge). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.
The final set of variables has been chosen based on sequential iterations, by running several models, eliminating non-significant features, then rerunning the model again and so on, until reaching the final model, not all the runs were included in this notebook (this notebook includes only a test model, initial model and a final reduced model).
The models (initial and reduced) were validated and compared through:
residuals.hist(bins = 50)
plt.title('Residuals')
plt.xlabel('residual($)')
plt.show()
Typical normal distribution of a discretized variable
fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(residuals, plot=ax) # QQ-Plot
plt.title('Residuals normality')
plt.tight_layout()
plt.show()
Note: The output should include the predictions from the refined model you used to perform the analysis.
D208_t1_output=X.copy()
D208_t1_output['data-MontlyCharge']=y
D208_t1_output['pred-MontlyCharge']=fitted_model.predict(X)
D208_t1_output['residuals']=fitted_model.predict(X)-y
D208_t1_output.to_csv('D208_t1_model_output.csv', index = False)
Included in this notebook
# Final reg. model Equation:
equation = ''
for (feature,coef) in dict(fitted_model.params).items():
equation = equation+' + (' + feature +' * '+str(np.round(coef,3))+ ') '
print(target+' = \n (',equation[12:])
MonthlyCharge = ( 83.839) + (Churn_Yes * 2.308) + (InternetService_Fiber Optic * 20.067) + (InternetService_None * -12.678) + (Multiple_Yes * 32.394) + (OnlineSecurity_Yes * 2.566) + (OnlineBackup_Yes * 22.381) + (DeviceProtection_Yes * 12.438) + (TechSupport_Yes * 12.558) + (StreamingTV_Yes * 41.641) + (StreamingMovies_Yes * 51.8)
All the variables represent paid services, probably the coefficients are the average monthly payments for each of these services:
(note: *Some very minor difference in the coef. values because of regenerating the models)
etc...
I. Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.
J. Demonstrate professional communication in the content and presentation of your submission.
Massaron, L., & Boschetti, A. (2016). Regression analysis with Python. Packt Publishing. ISBN: 9781785286315
Bruce, P. C., Bruce, A., & Gedeck, P. (2020). Practical statistics for data scientists: 50 essential concepts. Sebastopol, CA: O'Reilly Media, Incorporated.
(WGU) Predictive Modeling – D208 course materials and Labs
General Questions/Answers from https://stackoverflow.com/
Documentation of Python packages: pandas, matplotlib,numpy,seaborn,sklearn and scipy
https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/
https://www.youtube.com/watch?v=0MFpOQRY0rw&list=PLTNMv857s9WUI1Nz4SssXDKAELESXz-bi&index=7